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Abstract 

We discuss a multiple-play multi-armed ban¬ 
dit (MAB) problem in which several arms are 
selected at each round. Recently, Thompson 
sampling (TS), a randomized algorithm with a 
Bayesian spirit, has attracted much attention for 
its empirically excellent performance, and it is 
revealed to have an optimal regret bound in the 
standard single-play MAB problem. In this pa¬ 
per, we propose the multiple-play Thompson 
sampling (MP-TS) algorithm, an extension of TS 
to the multiple-play MAB problem, and discuss 
its regret analysis. We prove that MP-TS for bi¬ 
nary rewards has the optimal regret upper bound 
that matches the regret lower bound provided 
by Anantharam et al. (1987). Therefore, MP- 
TS is the first computationally efficient algorithm 
with optimal regret. A set of computer simula¬ 
tions was also conducted, which compared MP- 
TS with state-of-the-art algorithms. We also pro¬ 
pose a modification of MP-TS, which is shown 
to have better empirical performance. 


1. Introduction 

The multi-armed bandit (MAB) problem is one of the most 
well-known instances of sequential decision-making prob¬ 
lems in uncertain environments, which can model many 
real-world scenarios. The problem involves conceptual en¬ 
tities called arms. At each round, the forecaster draws one 
of K arms and receives a corresponding reward. The aim 
of the forecaster is to maximize the cumulative reward over 
rounds, and the forecaster’s performance is usually mea¬ 
sured by a regret, which is the gap between his or her 
cumulative reward and that of an optimal drawing policy. 


Throughout the rounds, the forecaster faces an “exploration 
vs. exploitation” dilemma. On one hand, the forecaster 
wants to exploit the information that he or she has gath¬ 
ered up to the previous round by selecting seemingly good 
arms. On the other hand, there is always a possibility that 
the other arms have been underestimated, which motivates 
him or her to explore seemingly bad arms in order to gather 
their information. To resolve this dilemma, the forecaster 
uses an algorithm to control the number of draws for each 
arm. 


In the stochastic MAB problem, which is the most widely 
studied version of the MAB problem, it is assumed that 
each arm is associated with a distinct probability distribu¬ 
tion. While there have been many theoretical studies on the 
infinite setting in which future rewards are geometrically 
discounted (e.g., the Gittins index ( |Gittins & Jones] [T974| l), 
recent availability of massive data has led to a finite horizon 
setting in which every reward has the same importance. In 
this work, we focus on the latter setting. 


There has been significant progress in this setting of the 
MAB problem. In particular, the upper confidence bound 
(UCB) algorithm ( |Auer et ST 2002| l has been widely used 
and studied for its computational simplicity and customiz¬ 
ability. Whereas the coefficient of the leading logarithmic 
term in UCB is larger than the theoretical lower bound 
given by |Lai & Robbins| ( |1985| ), algorithms have been pro¬ 
posed that achieve this bound, such as DMED (|Honda &| 
Takemura|[20T0l l, and KL-UCB ( jCappe et al.||2013| l. 

Moreover, Thompson sampling (TS) ( Thompson! 19331 
has recently attracted attention for its excellent perfor¬ 
mance ( |Scott| |2010[ IChapelle & Li[ |2011| l and it has been 
revealed to be applicable to even a wider class of problems 
(Agrawal & Goyal 2013a Russo & Roy| 20T3| Osband 


et al. 2013 Kocaketal. 2014[ Guha & Munagala 2014| l. 
Thompson sampling is an old heuristic that has a spirit of 
Bayesian inference and selects an arm based on posterior 
samples of the expectation of each arm. It has been shown 
that TS has an optimal regret bound ([Agrawal & GoyaT 
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2012 Kaufmann et al. 2012[ Agrawal & Goyal |2013b| l. 


1.1. Multiple-play MAB problem 

The literature mentioned above has specifically dealt with 
the MAB problem in which a single arm is selected and 
drawn at each round. Let us call this problem single-play 
MAB (SP-MAB). While the SP-MAB problem is indis¬ 
putably important as a canonical problem, in many prac¬ 
tical situations multiple entities corresponding to arms are 
selected at each round. We call the MAB problem in 
which several arms can be selected multiple-play MAB 
(MP-MAB). Examples of the situations that can be mod¬ 
eled as an MP-MAB problem include the followings. 

• Example 1 (placement of online advertisements): a 

web site has several slots where advertisements can 
be placed. Based on each user’s query, there is a set 
of candidates of relevant advertisements from which 
web sites can select to display. The effectiveness of 
advertisements varies: some advertisements are more 
appealing to the user than others. With the standard 
model in online advertising, it is assumed that each 
advertisement is associated with a click-through-rate 
(CTR), which is the number of clicks per view. Since 
web sites receive revenue from clicks on advertise¬ 
ments, it is natural to maximize it, which can be con¬ 
sidered as an instance of an MP-MAB problem in 
which advertisements and clicks correspond to arms 
and rewards, respectively. 

• Example 2 (channel selection in cognitive radio 
networks ( [Huang et al.[ [2008) ): a cognitive radio 
is an adaptive scheme for allocating channels, such 
as wireless network spectrums. There are two kinds 
of users: primary and secondary. Unlike primary 
users, secondary users do not have primary access to 
a channel but can take advantage of the vacancies in 
primary access and opportunistically exploit instanta¬ 
neous spectrum availability when primary users are 
idle. However, the availabilities of channels are not 
easily known. Usually, secondary users have access 
to multiple channels. They can enhance their commu¬ 
nication efficiency by adaptively estimating the avail¬ 
ability statistics of the channels, which can be con¬ 
sidered as an MP-MAB problem in which channels 
and the permission of communication are arms and 
rewards, respectively. 


There have been several studies on the MP-MAB prob¬ 
lem. Anantharam et al. ( 1987| l derived an asymptotic lower 
bound on the regret for this problem and proposed an al¬ 
gorithm to achieve this bound. Because their algorithm 
requires certain statistics that are difficult to compute, ef¬ 
ficiently computable MP-MAB algorithms have also been 
extensively studied. Chen et al. (2013|) extended a UCB- 


based algorithm to a multiple-play case with combinatorial 
rewards and Gopalan et al. (2014 1 extended TS to a wide 
class of problems. Although both papers provide a loga¬ 
rithmic regret bound, the constant factors of these regret 
bounds do not match the lower bound. Therefore, it is un¬ 
known whether the optimal regret bound for the MP-MAB 
problem is achievable by using a computationally efficient 
algorithm. 


The main difficulty in analyzing the MP-MAB problem 
lies in the fact that the regret depends on the combinato¬ 
rial structure of arm draws. More specifically, an algorithm 
with the optimal bound on the number of draws of subopti- 
mal arms does not always ensure the optimal regret bound 
unlike the SP-MAB problem. 


Contribution: Our contributions are as follows. 


• TS-based algorithm for the MP-MAB problem and 
its optimal regret bound: the first and main contri¬ 
bution of this paper is an extension of TS to the mul¬ 
tiple play case, which we call MP-TS. We prove that 
MP-TS for binary rewards achieves an optimal regret 
bound. To the best of our knowledge, this paper is 
the first to provide a computationally efficient algo¬ 
rithm in the MP-MAB problem with the optimal regret 
bound by Anantharam et al.|(|1987 1 . 


• Novel analysis technique: to solve the difficulty in 
the combinatorial structure of the MP-MAB problem, 
we show that the independence of posterior samples 
among arms in TS is a key property for suppressing 
the number of simultaneous draws of several subopti- 
mal arms, and the use of this property eventually leads 
to the optimal regret bound. 


• Experimental comparison among MP-MAB algo¬ 
rithms: we compare MP-TS with other algorithms, 
and confirm its efficiency. We also propose an em¬ 
pirical improvement of MP-TS (IMP-TS) motivated 
by analyses on the regret structure of the MP-MAB 
problem. We confirm that IMP-TS improves the per¬ 
formance of MP-TS without increasing computational 
complexity. 


2. Problem Setup 

Let there be K arms. Each arm i G \K] = 

{1,2,..., AT} is associated with a probability distribution 
Vi = Bernoulli(/ii), /ti S (0,1). At each round t = 
1,2,... ,T, the forecaster selects a set of L < K arms 
I{t), then receives the rewards of the selected arms. The 
reward Xi{t) of each selected arm i is i.i.d. samples from 
Vi. Let Ni (f) be the number of draws of arm i before round 
t (i.e., Ni{t) — ^ where Ijyl} = 1 if 

event A holds and = 0 otherwise.), and fii{t) be the em¬ 
pirical mean of the rewards of arm i at the beginning of 
round t. The forecaster is interested in maximizing the sum 
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of rewards over drawn arms. For simplicity, we assume 
that all arms have distinct expected rewards (i.e., 7 ^ pj 
for any i 7 ^ j). We discuss the case in which = /Xj 

for some i and j in Appendix A.l which is in Supple¬ 
mentary Material. Without loss of generality, we assume 
/xi > /X 2 > /X 3 > • • • > /xxf. Of course, algorithms do 
not exploit this ordering. We define optimal arms as top- 
L arms (i.e., arms [L]), and suboptimal arms as the others 
(i.e., arms \K] \ [L]). The regret, which is the expected loss 
of the forecaster, is defined as 


Reg(T) = XI ( XI 

t=i \ie[L] 



A MP-MAB instance with K=4, L=2 
^ 1 = 0.10 . 

^ 2 = 0.09 • 

^3=0.08 
A/4=0.07 


I 

^ ^ optimal arms 
I ^ suboptimal arms 


Game 1 

Game 2 

f=l 7(1)= {1,2} 

7(1) ={1,3} 

(r(l) = 0) 

(r(l)=0.01) 

t=2 7(2) ={3,4} 

7(2)= {1,4} 

(r(2) = 0.04) 

(r(2)=0.02) 

Regret(2)=0.04 

Regret(2)=0.03 


The expectation of regret E[Reg(T)] is used to measure the 
performance of an algorithm. 

3. Regret Bounds 

In this section we introduce the known lower bounds of 
the regret for the SP-MAB and MP-MAB problems and 
discuss the relation between them. 

3.1. Regret bound for SP-MAB problem 


Figure 1. Two bandit games with the same set of arms. r{t) is 
defined as the increase in the regret at round t. In both games 
1 and 2, we have the same number of suboptimal arm draws 
(As(2) = ^ 4 ( 2 ) = 1). However, the regret in games 1 and 2 
are different. 

on the number of suboptimal arm draws does not directly 
lead to the optimal regret. From this point forward, we 
focus on the MP-MAB problem in which L is not restricted 
to one. 


The SP-MAB problem, which has been thoroughly studied 
in the fields of statistics and machine learning, is a special 
case of the MP-MAB problem with L = 1. The optimal 
the SP-MAB problem was given by |Lai &] 
. They proved that, for any strongly consis¬ 
tent algorithm (i.e., algorithms with subpolynomial regret 
for any set of arms), there exists a lower bound 

where d(p,g) = plog (p/g)-b(l-p) log ((1 - p)/(l - q)) 
is the KL divergence between two Bernoulli distributions 
with expectation p and q. Note that when arm i is drawn, 
the regret increases by 1 and the regret is written as 

E[Reg(r)] = X N^iT + (2) 

where Ai j = pj — /x^. Therefore, inequality Q directly 
leads to the regret lower bound 

One may think that applying the techniques of the SP-MAB 
problem would directly yield an optimal bound for a more 
general MP-MAB problem. However, this is not the case. 
In short, the difficulty in analyzing the regret on the MP- 
MAB problem arises from the fact that the optimal bound 


regret bound m 


Robbins|( 19851 


3.2. Extension to MP-MAB problem 


The regret lower bound in the MP-MAB problem, which is 
the generalization of inequality Q, was provided by |Anan-] 
tharam et al. ( 1987) 1. They first proved that, for any strongly 
consistent algorithm and suboptimal arm i, the number of 
arm i draws is lower-bounded as 


E[A,(T + 1)] > 


l-o(l) 

d{Fi,FL) 


logT. 


(4) 


Unlike in the SP-MAB problem, the regret in the MP-MAB 
problem is not uniquely determined by the number of sub¬ 
optimal arm draws. As illustrated in Figure [T] the regret is 
dependent on the combinatorial structure of arm draws. 

Recall that a regret increase at each round is the gap of ex¬ 
pected rewards between the optimal arms and that of the 
selected arms. When a suboptimal arm is selected, one op¬ 
timal arm is excluded from I{t) instead of the suboptimal 
arm. Let the selected suboptimal arm and excluded opti¬ 
mal arm be i and j, respectively. Then, we lose expected 
reward /Xj — /x^. Namely, the loss in the expected reward at 
each round is given by 

X “ X = X X 

je[L] iei{t) je[L]\i{t) iei(t)\[L] 

- X 

iG/(t)\[L] 
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Algorithm 1 Multiple-play Thompson sampling (MP-TS) 
for binary rewards 

Input; # of arms K, # of selection L 
for i = 1, 2,..., AT do 
= 1,1 

end for 

t ^ 1. 

for f = 1,2,..., T do 

for i = 1,2,..., iT do 
9^{t) Beta.{Ai,B^) 

end for 

I{t) = top-L arms ranked by 9i{t). 
toriGi (t) do 
if Xi{t) = 1 then 
Ai •<— Aj + 1 

else 

Bi Bi 1 

end if 
end for 
end for 


where we used the fact fj,j > for any optimal arm j. 
From this relation, the regret is expressed as 


Reg(r) > E E i^J■L - /Tj) 

t=l iG/(t)\[L] 

= ^ ( 6 ) 

iG[tr]\[L] 


which, combined with Q, leads to the regret lower bound 
by Anantharam et al.] ( |1987| l that any strongly consistent 
algorithm satisfies 


E[Reg(T)] > 


E 

iG[tG]\[L] 


(1-o(1))A,,l 

d{ni,HL) 


logT. 


(7) 


3.3. Necessary condition for an optimal algorithm 

In Sections |3.1| and |3.2| we saw that the derivations of 
the regret bounds are analogous between the SP-MAB and 
MP-MAB problems. However, there is a difference in the 
relation between the regret and Ni (T), the number of draws 
of suboptimal arms, is given as equation (15 in the SP-MAB 
problem, whereas it is given as inequality (|5 in the MP- 
MAB problem. This means that, an algorithm achieving 
the asymptotic lower bound (|5 on Ni{T) does not always 
achieve the asymptotic regret bound 0. 

When suboptimal arm i is selected, one of the optimal arms 
is pushed out instead of arm i, and the regret increases by 
the difference between the expected rewards of these two 
arms. The best scenario is that, arm L, which is the optimal 
arm with the smallest expected reward, is almost always 


the arm pushed out instead of a suboptimal arm. For this 
scenario to occur, it is necessary to ensure that at most one 
suboptimal arm is drawn for almost all rounds because, if 
two suboptimal arms are selected, at least one arm in [T—1] 
is pushed out. 

In the next section, we propose an extension of TS to the 
MP-MAB problem, and explain that it has a crucial prop¬ 
erty for suppressing this simultaneous draw of two subop¬ 
timal arms. 


Remark: Corollary 1 of |Gopalan et al.| ( |2014| l shows the 
achievability of the bound in the RHS of (0 on the num¬ 
ber of draws of suboptimal arms. Whereas this does not 
lead to the optimal regret bound as discussed above, they 
originally derived in Theorem 1 an (9(logT) bound on the 
number of each suboptimal action (that is, each combina¬ 
tion of arms including suboptimal ones) for a more general 
setting of MP-MAB. Thus, we can directly use this bound 
to derive a better regret bound. However, to show the op¬ 
timality in the sense of regret it is necessary to prove that 
there are at most o(log T) rounds such that an arm in [T— 1] 
is pushed out. Therefore, it still requires further discussion 
to derive the optimal regret bound of TS. Note also that the 
regret bound by Gopalan et al. ( 2014| l is restricted to the 
case that the prior has a finite support and the true param¬ 
eter is in the support, and thus their analysis requires some 
approximation scheme for dealing Bernoulli rewards. 


4. Multiple-play Thompson Sampling 
Algorithm 


Algorithm[^is our MP-TS algorithm. While TS for single¬ 
play selects the top-1 arm based on a posterior sample 9i{t), 
MP-TS selects the top-L arms ranked by the posterior sam- 



to achieve the optimal regret bound is to suppress the si¬ 
multaneous draws of two or more suboptimal arms, which 
characterizes the difficulty of the MP-MAB problem. 


Note that it is easy to extend other asymptotically opti¬ 
mal SP-MAB algorithms, such as KL-UCB, to the MP- 
MAB problem. Nevertheless, we were not able to prove 
the optimality of these algorithms for the MP-MAB prob¬ 
lem though the achievability of the bound (0 on Ni{T) is 
easily proved, and the simulation results in Section |7] also 
imply their achievability of the regret bound. This is be¬ 
cause TS has quite a plausible property to suppress simul¬ 
taneous draws as we discuss below. 

Before the exact statement in the next section, we give an 
intuition for the natural extension of TS (or other asymp¬ 
totically optimal SP-MAB algorithms) can have the opti- 
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mal regret in the MP-MAB problem. Roughly speaking, a 
bandit algorithm with a logarithmic regret draws a subop- 
timal arm with probability 0{l/t) at the t-th round, which 
amounts to 1/t) = O(logT) regret. Thus, two 

suboptimal arms are drawn at the same round with proba¬ 
bility 0{l/t^), which amounts to 1/^^) = 0(1) 

total simultaneous draws, provided that each suboptimal 
arm is selected independently. 

In TS, the score 9i{t) for the choice of arms is generated 
randomly at each round from the posterior independently 
between each arm, which enables us to bound simultane¬ 
ous draws as the above intuition. On the other hand, in 
KL-UCB (or in other index policies), the UCB score for 
the choice of arms is deterministic given the past results of 
rewards, which means that the scores of suboptimal arms 
may behave quite similarly in the worst case on the past 
rewards. 


5. Optimal Regret Bound 

In this section, we state the main theoretical result (Theo¬ 
rem!^. The analysis that leads to this theorem is discussed 
in Section|6] 

Theorem 1. (Regret upper bound of MP-TS) For any suffi¬ 
ciently small ei > 0, £2 > 0, the regret of MP-TS is upper- 
bounded as 


E[Reg(T)]< 5] 

*G[if]\[L] 


(1 -f ei)Ai,L logT 


d(/ij,pL) 

-I- C'a(ei, /ri, p2, ■ ■ ■, Fk) + Cb{T, £ 2 , pi, P2, ■ ■ •, Fk), 


where, Ca = Caici, Fiy F 2 , ■ ■ ■ ,Fk) is a constant inde¬ 
pendent on T and is 0{€f^) when we regard {Fi}^=i ‘^s 
constants. The value Cb = Cb{T,e 2 , Fly F 2 t ■ ■, Fk) is 
a function ofT, which, by choosing proper £ 2 , grows at a 
rate o/0(loglogT) = o(logr). 

By letting £1 = 0((logT)“^/^) we obtain 

E[Reg(r)]< ^ ^Mtl^+o((iogT)2/3) (8) 

..rtlTrr, d{F^, Fl) 


and we see that MP-TS achieves the asymptotic bound in 

0 . 


Expected regret and high-probability regret: Anan 


tharam et al. ( 1987|l originally derived a regret lower bound 


in a stronger form than 0 such that for any £ > 0, the 
regret of a strongly consistent algorithm is lower-bounded 
as 


lim Pr 

T—>-<50 


Reg(T) 

logT 


^ E 

ielK]\[L] 


(1 — 

d{Fi,FL) 


= 1 . 


Combining this with 0 we can easily see that MP-TS sat¬ 
isfies 


R6g(T) ^ (1 -f e)Ai^L 


(9) 


that is, MP-TS is also asymptotically optimal in the sense 
of high probability. Since an algorithm satisfying 0 is not 
always optimal in the sense of expectation, our result, the 
expected optimal regret bound, is also stronger in this sense 
than the high-probability bound by|Gopalan et al.|(|2014[). 


6. Regret Analysis 

We first define some additional notation that are useful for 
our analysis in Section |6.1| then analyze the regret bound 
in Section |6.2| The proofs of all the lemmas, except for 
Lemma]^ are given in the Appendix. 

6.1. Additional notation 

Let — S and -f <5 for 5 > 0 and 

i G [K] \ [L], We assume 6 to be sufficiently small such 
that e (/rL+i,pi) and S {p,i,p.L). We also 
define . Intuitively, is the 

sufficient number of explorations to make sure that arm i is 
not as good as arm L. 

Events: Now, let denote the m-th largest el¬ 
ement of {cijigs G that is, = 

max 5 /g 5 .|s/|=m minigs/Oj. We define 0*{t) = 

max^gj^j 9i{t) as the L-th largest posterior sample at round 
t (i.e., the minimum posterior sample among the selected 
arms), and 9f* .{t) = 6»fc(f) as the (L - 1)- 

th largest posterior sample at round t except for arms i and 
j. Moreover, let v = Let define the following 

events. 


At{t) = 

{i G /(f)}, 

m = 

{9*{t)>FV}, 

C.(f) = 

j^[K]\([L-l]U{i}) 

v,{t) = 

{N^it) < ivr'(r)}. 

Event Ai{t) states that arm i is sampled at round t, and 
states that arm i has not been sampled sufficiently 


yet. The complements of B{t) and Ci{t) are related to the 
underestimation of optimal arms. Since the optimal arms 
are sampled sufficiently, or Cf{t) should not occur 

very frequently. 

6.2. Proof of Theorem [T] 

We first decompose the regret to the contribution of each 
arm. Recall that, the regret increase by drawing suboptimal 
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arm i is determined by the optimal arm excluded in the 
selection set I{t). Formally, for suboptimal arm i, let 


\it) = 


(max^g[i]\/(t) - n, if I{t) ^ [L], 

0 otherwise, 


( 10 ) 


and 


Reg,(T)=^l{*e/(f)}A,(f). 


From inequality Q the following inequality is easily de¬ 
rived 

Reg(r) < ^ Reg,(r). 

We next decompose Regj(T) into several terms by using 
events A-T). After giving bounds for these terms, we fi¬ 
nally give the total regret bound, which proves Theorem 
Note that, in bounding the deviation of Bernoulli means 
and Beta posteriors in the Appendix, our analysis borrowed 
some techniques developed in the context of the SP-MAB 
problem, mostly from Agrawal & Goyal|(| 2013b[ ), and some 
from [Honda & Takemura| ( 2014] l. 

Lemma 2. The regret by drawing suboptimal arm i > L is 
decomposed as: 


T T 

Reg,(T) + 

(A) (B) 

T 

+ E EiM 

jG[tf]\([L-l]U{i}) i=l 

^ 

(C) 

T 

+^ m)} +ivr'(r)A,.i, 

t=l 

(D) 

where, for example, {A, B} abbreviates {A H B}. 

Roughly speaking, 

• Term (A) corresponds to the case in which, some of 
the optimal arms are under-estimated. 

• Term (B) corresponds to the case in which, arm i is 
selected and some of the arms in [L — 1] are under¬ 
estimated. 

• Term (C) corresponds to the case in which, arm i G 
[K] \ [L] and j G [K] \ ([L — 1] U {i}) are simulta¬ 
neously drawn. In particular, term (C) is unique in 
the MP-MAB problem that causes additional regret 
increase, and in analyzing this term we fully use the 
fact that the samples of the posterior distributions on 
the arms are independent of each other. 


• Term (D) corresponds to the case in which, arm i is 
selected after it is sufficiently explored. 

Proof of Lemma^ The contribution of suboptimal arm i 
to the regret is decomposed as follows. By using the fact 
Ai(f) < 1 and the following decomposition of an event 

A{t) c B^{t) u u {A,{t),B{t)Mt)} 

cs^(<)u{A(t),cn<)} 

U {A^{t),B{t),Vt{t)} U {A^{t),C,{t),V,{t)}, 


we have 


T 

Reg,{T) 

T T 

T 

T 

+ ( 11 ) 

Recall that Ai(t) is defined as At each round, when L 
and all suboptimal arms, except for i, are not selected, then 
/(f) = {1,2,... ,L - l,i}; Ai{t) = Ai^L- Therefore, 


T 

T 

+ ;^l{A;(f),C,(f),I?,(f), U A,it)} 

t=i j-e[K]\([L-i]u{i}) 

T 

< ^l{A(f),2?*(f)}A,,L 

T 

+ E En-^ 

jG[tf]\([L-l]U{i}) i=l 

< ivr^(r)A,,i 

T 

+ E EiM 

jG[tf]\([L-l]U{i}) i=l 

( 12 ) 


Summarizing o and © completes the proof. 


□ 


The following lemma bounds terms (A)-(D). 

Lemma 3. (Bounds on individual terms) Let €2 > 0 be 
arbitrary. For sufficiently small S and 62 , the four terms 















Optimal Regret Analysis of Thompson Sampling in Stochastic Mnltl-armed Bandit Problem with Multiple Plays 


are bounded in expectation as: 


1 


E[(A)] = O - 

V(ml - Ml y 

E[(B)] = O(loglogT), 




£2 + 4T 


E[(C)] < ^ 

jG[i<-]\([L-l]U{i}) 

and 

E|(D)1<2+ 4 


logT 


= O 




(13) 

(14) 

- 0 ( 1 ), 

(15) 

(16) 


The proof of Lemmaj^is in Appendix |A.4| Lemma|^states 
that terms (A), (B), and (D) are 0(1/6^. Moreover, the 
following lemma bounds term (C). 

Lemma 4. (Asymptotic convergence of e 2 -dependent fac¬ 
tor) By choosing an 0((loglogT)/logr) value of 62 , we 
obtain E,[{C)] = O(loglogT). 


The proof of Lemma is in Appendix A.5 Now it suf 
flees to evaluate = 


logT 


to complete the 

proof From the convexity of KL divergence there exists a 
constant Ci = Ci{p,i, /tl) > 0 such that 

= d{pi + 5 ,pll- 5)>{l- Ci6)d{pi, pl) 

and therefore 






E[Reg(r)]< ^E[Reg,(T)] < ^E 

^G[K]\lL] iG[iG]\[L] 

< ^ {E [(A) + (B) + (C) + (D)] + NryT)\,L} 

i&[K]\[L] 

^ Ai L log T 

zG[i<-]\[L] 


- Cjd)(i(pi,/iL) ' 


main term 

\-i 


Ca 


Ci 


Since (1 — CiS) ^ < 1 + 2ciS for CiS < 1/2, we com¬ 
plete the proof of Theorem by letting ei < 1/2 and 

S = ei/ maxigjL-j^fL] c* = 0(ei). □ 

7. Experiment 

We ran a series of computer simulation^to clarify the em¬ 
pirical properties MP-TS. The simulations involved the fol¬ 
lowing three scenarios. In Scenarios 1 and 2, we used fixed 
arms similar to that of Garivier & Cappe ( 201 l| l, and Sce¬ 
nario 3 is based on a click log dataset of advertisements on 
a commercial search engine. 

Algorithms: the simulations involved MP-TS, Exp3.M 
( jUchiya et al] |20T0| |, CUCB ( |Chen et al.| [20T3] l, and 


*The source code of the simulations is available at 
https:// github. com/j komiy ama/ multiplay banditlib. 


MP-KL-UCB. Exp3.M is a state-of-the-art adversarial 
bandit algorithm for the MP-MAB probleir0 The 
learning rate 7 of Exp3.M is set in accordance with 
Corollary 1 of Uchiya et al. (2010 1 . Note that the 
CUCB algorithm in the MP-MAB problem at each 
round draws the top-L arms of the UCB indices jli + 
y/{3\ogt)/(2Ni(t)). MP-KL-UCB is the algorithm that 
selects the top-L arms in accordance with the KL-UCB in¬ 


dex {q\Nyt)d{iiyt),q) < logf}. 

Scenario 1 (5-armed bandits): the simulations 

include 5 Bernoulli arms with {pi,..., = 

{0.7, 0.6,0.5, 0.4,0.3}, and L = 2. 


Scenario 2 (20-armed bandits): the simulations in¬ 
clude 20 Bernoulli arms with pi = 0.15, p 2 = 0.12, 
/i 3 = 0.10, Pi = 0.05 for i G (4, 5,..., 12}, pi = 0.03 for 
iG {13,14,...,20}, andL = 3. 

Scenario 3 (many-armed bandits, online advertisement 
based CTRs): we conducted another set of experiments 
with arms whose expectations were based on the dataset 
provided for KDD Cuf[^2012 track 2. The dataset involves 
a click log on soso.com (a large-scale search engine ser¬ 
viced by Tencent), which is composed of 149 million im¬ 
pressions (view of advertisements). We processed the data 
as follows. Eirst, we excluded users of abnormally high 
click probability (i.e., users who had more than 1,000 im¬ 
pressions and more than 0.1 click probability) from the log. 
We also excluded minor advertisements (ads) that had less 
than 5, 000 impressions. There are a wide variety of ads 
on a search engine (e.g., ’’rental cars”, ’’music”, etc.) and 
randomly picking ads from a search engine should yield a 
set of irrelevant ads. To address this issue, we selected pop¬ 
ular queries that had more than 10 "^ impressions and more 
than 50 ads that appeared on the query. As a result, 80 
queries were obtained. The number of ads associated with 
each query ranged from 50 to 105, and the average click- 
through-rate (CTR, the probability that the ad is clicked) of 
an ad on each query ranged from 1.15% to 6 . 86 %. After 
that, each ad was converted into a Bernoulli arm with its 
expectations corresponding to the CTR of the ad. At the 
beginning of each run, one of the queries was randomly se¬ 
lected, and the bandit simulation with the arms correspond¬ 
ing to the query and L = 3 is then conducted. This scenario 
was more difficult than the first two scenarios in the sense 
that 1 ) a larger number of arms were involved and 2 ) the 
reward gap among arms was very small. 


The simulation results are shown in Eigure In all sce¬ 
narios, the tendency is the same: our proposed MP-TS per¬ 
forms significantly better than the other algorithms. MP- 
KL-UCB is not as good as MP-TS, but clearly better than 
CUCB and Exp3.M. While it is unclear whether the slope 

^Note that, Exp3.M is designed for the adversarial setting in 
which the rewards of arms are not necessarily stationary. 

^ https ://www.kddcup2012.org/ 
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(a) Scenario 1 (b) Scenario 2 (c) Scenario 3 

Figure 2. Regret-round plots of algorithms. The regret in Scenarios 1 and 2 are averaged over 10,000 runs, and the regret in Scenario 3 
is averaged over 1, 000 runs. “Lower Bound” is the leading f2(logr) term of the RHS of inequality 0- We do not show Lower Bound 
in Scenario 3 because the coefficient of the bound can sometimes be quite large (i.e., in some runs, l/d{gLL+\, g-r) is large). 


of the regret of MP-KL-UCB converges to the asymptotic 
bound or not, the slope of the regret of TS quickly ap¬ 
proaches the asymptotic lower bound. 

7.1. Improvement of MP-TS based on the empirical 
means 



Figure 3. Before/after comparison of MP-TS. All settings (except 
for algorithms) are the same as that of Scenario 3. 

We now introduce an improved version of MP-TS (IMP- 
TS). In the theoretical analysis of the MP-MAB problem, 
we observed that an extra loss arises when multiple subop- 
timal arms are drawn at the same round. Based on this ob¬ 
servation, the new algorithm selects L — 1 arms on the basis 
of empirical averages and selects the last arm on the basis 
of TS to avoid simultaneous draws of suboptimal arms. In 
other words, this algorithm is further aimed to minimize 
the regret by purely exploiting the knowledge in the top- 
(L — 1) arms; thus, limiting the exploration to only one 
arm. One might fear that this increase in exploitation could 
devastate the balance between exploration and exploitation. 
Although we provide no regret bound for the improved ver¬ 
sion of the algorithm, we expect that this algorithm will 
also achieve the asymptotic bound for the following reason. 
When we restrict the exploration to one arm, the number of 
opportunities for an arm to be explored may decrease, say, 
from TtoT/L. Still, TjL opportunities are sufficient since 
0{\og{T/L)) = O(logr). In fact, the algorithm proposed 
by|Anantharam et ar](|I987|l achieves the asymptotic bound 


even though L — 1 arms are selected based on empirical 
means as in IMP-TS. Similarly, we define an improved ver¬ 
sion of MP-KL-UCB (IMP-KL-UCB) for selecting the first 
L — 1 arms on the basis of empirical averages. The be¬ 
fore/after analysis of this improvement is shown in Figure 
0 One sees that, (i) MP-TS still performs better than IMP- 
KL-UCB, and (ii) IMP-TS reduces the regret throughout 
the rounds. In particular, when the number of the rounds is 
small (T ~ 10^-10"^), the advantage of IMP-TS is large. 

8. Discussion 

We extended TS to the multiple-play setting and proved its 
optimality in terms of the regret. We considered the case in 
which the total reward is linear to the individual rewards of 
selected arms. The analysis in this paper fully uses the in¬ 
dependent property of posterior samples and paves the way 
to obtain a tight analysis on the multiple-play regret that de¬ 
pends on the combinatorial structure of arm selection. We 
now point out two promising directions for future work. 


Position-dependent factors for online advertising: 

it is well-known that the CTR of an ad is dependent 
on its position. Taking the position-dependent fac¬ 
tor into consideration changes the MP-MAB problem 
from the L-set selection problem to the L-sequence 
selection problem in which the position of L arms 
matters. For the starting point, we consider an ex¬ 
tension of MP-TS for the cascade model (|Kempe & 


|Mahdian] |2008[ [Aggarwal et al.[ |20^ i that corrects 


position-dependent bias in Appendix A.2 


Non-Bernoulli distributions for general problems: 

for the ease of argument, we exclusively consider the 
binary rewards. The analysis by Korda et al. (20131 
is useful in extending our result to the case of the 1-d 
exponential families of rewards. Moreover, extend¬ 
ing our result to multi-parameter reward distributions 
(|B urne t as & Katehakis[ | 1996 [ [Ho nda & Takemura ] 
2014|) is interesting. 
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A. Appendix 

A.I. Cases of several arms having the same expectation 

Up to now, we have assumed that all arms have distinct ex¬ 
pectations. Here, we consider cases in which some arms 
have the same expectations. Without loss of generality, 
we assume pi > p 2 Mif- Let us call arms 

with a larger expectation than “strictly optimal” arms, 
arms with the same expectation as “marginal” arms, and 
arms with a smaller expectation than “strictly subopti- 
mal” arms. Each arm is either strictly optimal, marginal, or 
strictly suboptimal. 

Case 1: Assume that all strictly optimal arms are distinct, 
that there is only one marginal arm, and that there are sev¬ 
eral strictly suboptimal arms with the same expectation. In 
this case, the regret bound of Theorem[T]holds because our 
analysis deals with each suboptimal arm separately. 

Case 2; Assume that there is only one marginal arm, that 
all strictly suboptimal arms are distinct, and that there are 
several strictly optimal arms with the same expectation. 
The regret bound also holds in this case since there is a 
gap between each strictly suboptimal arm and each strictly 
optimal arm. 

Case 3: Assume that all strictly optimal arms and strictly 
suboptimal arms are distinct and that there are several 
marginal arms with the same expectation. Unfortunately, 
we were unable to perform a meaningful analysis in 
this case. Intuitively, as stated by Agrawal and Goyal 
20T2 | i for SP-MAB, adding an addi¬ 
tional marginal arm appears to require some extra explo¬ 
ration, which slightly increases the regret. However, the re¬ 
gret structure is more complex than the SP-MAB because 
several marginal arms can be drawn simultaneously. 

In summary, our Theorem [T] holds when the marginal arm 
is distinct. That is, /ii > p 2 > • ■ • > Mi-i > Mi > 
Mi+i > • • • > Mat- 


(Agrawal & Goyal] 


A.2. Cascade model and position-dependent MP-MAB 
problem 

In the main paper, we assumed that the rewards of arms are 
independently and identically drawn from individual dis¬ 
tributions. In this section, we relax this assumption and 
consider a wider class of the MP-MAB problem. Remem¬ 
ber that, one of our primary applications is multiple adver¬ 
tisement placement in the online advertising problem (c.f.. 
Example 1). In this section, we interchangeably use the 
terms an advertisement (ad) and an arm. It is known that 
the CTR of an ad depends on the environment where the 
ad is placed, especially on the position of the ad. Among 
several models that explain this dependency on the posi¬ 
tion, the model that explains human behavior and agrees 


Algorithm 2 Bias-Corrected Multiple-play Thompson 
sampling (BC-MP-TS) for binary rewards 

Input; # of arms K, # of positions L, discount fac¬ 
tors { 7 ; (i)} 
for i = 1,2,..., AT do 
= 1,2 

end for 

t ^ 1. 

for f = 1,2 ..., T do 

for * = 1,2,..., iT do 
Bi ^ max {Ni — Ai,l) 

0 i{t) ^ Beta(Ai,Bi) 

end for 

Select Iiit) (^ = 1,..., L) in accordance with Section 

IAT2I 

for Z e 1,2,..., L do 

if Xi{t) = 1 then 

Ai •<— Ai -f 1 

end if 

Ni + ni'=2Ti'(Li'-i(f)) 

end for 
end for 


well with real data ( |Craswell et al.| |2008| l is the cascade 
model (Kempe & Mahdian 2008[ Aggarwal et al. 2008| l, 
with which it is assumed that the user scans the ads from 
top to bottom. Eollowing Gatti et al. ( |2012| l, we define 
the discount factor ^i{i) for I > 2 as the probability that 
a user observing ad i in position / — 1 will observe the 
ad in the next position. Namely, the MP-MAB problem 
with a discount factor is defined as a MP-MAB problem 
in which the arm at position I yields reward 1 with proba¬ 
bility (n /'=2 7i'(L;'-i(f))) M 7 (t), where //(f) be the arm 


placed at the f-th position at round t. Note that, when we 
set 7 / (z) = 1 for any position I G \L\ and ad i, this model is 
reduced to the model we considered in the main paper. In 
the MP-MAB problem in the main paper, the order of the L 
arms does not matter. Whereas, under a position-dependent 
discount factor smaller than 1, the order of L arms matters: 
the problem is not the selection of an L-set of arms, but an 
L-sequence of arms. 


A.2.1. Thompson sampling for cascade model 

In the cascade model, there is some probability that the arm 
at position ( > 1 is not drawn. The probability that the 
arm at position I is drawn, np= ^2 7;'(Li'-i(f)), can be con¬ 
sidered as the effective number of the draws at position i. 
MP-TS (Algorithm[2l keeps A/ and Bi, which respectively 
correspond to the number of rewards 1 and 0. The number 
of draws on the arm i is Ni = A/ -f B/. When we consider 
the cascade model, we need to take the effective number of 
draw into consideration. We introduce Bias-corrected MP- 
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TS (BC-MP-TS, Algorithmic. The crux of BC-MP-TS is 
that, for each arm that is selected, Ni should be increased 
not by 1 , but by the effective number of draw for each posi¬ 
tion. Note that, when = 1, BC-MP-TS is essentially 
the same as MP-TS. 


A.2.2. Optimal arm selection and the regret 


In general discount factor 7 i(i), even if we have perfect 
information over the expectation of all arms the 

computation of the optimal sequence of L-arms at each 
round t (optimal arm selection) appears to be computation¬ 
ally intractable when K is large because we need to search 
all the possible allocation of K ads over L positions. In the 
case where ^i{i) = 7 (*), Kempe & Mahdian (2008 i pro¬ 
posed a polynomial-time approximation of the optimal arm 
selection. We can obtain the arm selection strategy for BC- 
MP-TS by using this approximation algorithm as an oracle 
and plugging as estimated expected rewards. 


Ad-independent discount factor; when the discount fac¬ 
tor is independent of the ad at that position (i.e., ji{i) = 
7 ;), the optimal arm selection is easy: just select /r; (i.e., 
Tth best arm) on the Z-th position. We define the arm selec¬ 
tion strategy of BC-MP-TS as placing the arm of the Tth 
largest 0i (i.e., Ii(t) = max^^j^j Oi) on the Z-th position. 

Regret; naturally, the regret per round is defined as the 
difference between the expected reward of the optimal arm 
selection and that of an algorithm. Namely, 


T L 


t=l 1=1 \l>=2 


Reg(r) = E E 1 n - l))A^/op,(0 

l'^2 

n 7i'(-^i'-i(0) > 




effective number of draw at position I 



Figure 4. Simulation with a discount factor. Lower Bound is the 
leading fl(logT) term of the RHS of inequality l[7](, which we 
have conjectured to be the lower bound for the cascade model 
with the ad-independent discount factor in Section |A.2.2| The 
regret is averaged over 10,000 runs. 

A.2.3. Experiment of cascade model 

This simulation adapts the cascade model and involves a 
constant discount factor 7 i(Z) = 0.7 for any position and 
arm. There are 9 Bernoulli arms with pi = 0.24, p ,2 = 
0.21,..., /Tg = 0.00 and L = 3. In this case the optimal 
arm selection strategy is to choose {/i(f),/gjt),/ 3 (f)} = 
{/ii,/i 2 ,/is} (c.f.. Section p\.2.2| l. The regret of the al¬ 
gorithms is shown in On one hand, MP-TS failed to 
have a small regret due to its ignorance to the discount fac¬ 
tors. On the other hand, the slope of BC-MP-TS quickly 
approaches the conjectured Lower Bound, which is em¬ 
pirical evidence of the ability of BC-MP-TS to correct the 
position-dependent bias. 

A.3. Key fact and lemmas 

Fact 5. (Chernoff bound for binary random variables) 

Let Xi ,..., Xn be i.i.d. binary random variables. Let X = 
^ X]r=i F — E[Ai]. Then, for any e € (0,1 — fx), 

Pr(A > p + e) < exp {—d{gL -\- e, fijn). 


where (/opt(l),... ,/opt(T)) is the optimal arm selection. 
In the case of the ad-independent discount factor, we con¬ 
jecture that the regret lower bound should be identical to 
the case of no-discount factor that we analysed in the main 
paper (i.e., inequality (0). Although we do not prove any 
regret bound for this cascade model, the conjecture is sup¬ 
ported by the fact that (i) by identifying the top-L arm 
we immediately obtain the optimal arm selection, (ii) algo¬ 
rithms should require log T/d{ni, ^l) number of effective 
draws to convince that suboptimal arm i > Lis not as good 
as arm L, and (iii) the best situation is that the simultane¬ 
ous draw of several optimal arms rarely occurs: arm L is 
pushed out instead of arm i, and the regret increase per an 
effective draw is iiL—Fi- In ths case of the general discount 
factor, the problem is subtler because a slight difference in 
{/ii} can change the optimal arm selection. 


and, for any e G ( 0 , /t), 

Pr(A < /i — e) < exp {—d{fi — e, fJ,)n). 

Fact 6. (Beta-Binomial equality) Let be the cdf 

of the beta distribution with integer parameters a and p. 
Let F^p{-) be the cdf of the binomial distribution with pa¬ 
rameters n, p. Then, 

F^Tiy) = 1 - 

Fact 7. (Pinsker’s inequality for binary random variables) 
For p, q G (0,1), the KL divergence between two Bernoulli 
distributions is bounded as: 

d{p,q) > 2{p-qf. 
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Lemma 8. (Lemma 2 in Agrawal & Goyal ( 2013b| l) Let 
k S [K], n > 0 and x < Let jj,k,n be the em¬ 
pirical average of n samples from Bernoulli(^fc). Let 
PkAx) = 1 - „)„+i(2/) be the probability 

that the posterior sample from the Beta distribution with its 
parameter flk,n'n + 1,(1 — flk,nA + 1 exceeds x. Then, its 
average over runs is bounded as: 


E 


1 


,Pk,n{x) 


1 J_3_ 

^ ^ Ak(x) 


< 


(n < 8/Ak{x)) 


1 -U P) I p-Ak(xfn/1 J _1_ -Dk(x)n 

i-l-Uie + (n+l)Afc(x)2e 


+ ■ 


,Afc(=,)2„/4_i 


(n > 8/Ak{x)), 


where Afc(x) = pu- x, Dk{x) = d{x, pk)- 

In the proof of Lemma we use the following Lemmas 
0[TO] and [m several times. Lemma is essentially the 
combination of the existing techniques of Agrawal & Goyal| 
( |2013b| l and [Honda & Takemura| ( |20 1 4[ ) ' Lemmas 1 1 0| and 


11 are also existing techniques that appear in several previ¬ 
ous analyses in Bayesian bandits with Bernoulli arms. 

Lemma 9. Let k £ [K], z < pk be arbitrary, S{t), T(t), 
and Li (t) be events such that 


(i) if {pk{t) > z}, S{t), and T(t) occurred then the arm 
k is drawn at round t. 


Proof First we have 

T 

APk(t) < z,S{t),U{t),Nk{t) < Nc} 

Na T 

< < z,S{t),U(t),Nk{t) = n} 

n—0t—1 
N„ T 

n=0 m—1 

(18) 


T 

m<'^\{pk{t) <z,S(t),U{t),Nk{t) =n} 


Here note that the event 

T 

TO < y] l{pk{t) < z,S{t),U{t),Nk{t) = nj 

implies that the event 

{Sit),Uit),Nk{t) = n} (19) 

occurred for at least to rounds and {pkit) < z} or 
occurred for the first to rounds such that occurred. 
Thus, by using the mutual independence of {pkit) < z}, 
S(t), and T{t), we have 


Pr 

T 

TO < y l{/ifc(0 < z,S{t),l4{t),Nk{t) = n} 

t — 1 

f^k,n 


< Y - Pk,n{z)qA 

(20) 


and therefore 


(ii) pkA’ ‘5(0 T~it) mutually independent given 
{h{i)A=i and 

(Hi) The event U{t) is deterministic given {pi{t)}fLi and 

(iv) Given {pi{t)}ffi and {V(0}i0i ^nch that U{f) 
holds, T(t) occurs with probability at least q> 0. 


Then 


E 


^l{pk(t) < z,S{t),U{t),Nk(t) < Ac} 

1 


= o 


q{pk - z)2 




1-q 


In particular, by setting T(t) and lA (t) the trivial events 
that always hold (q = 1), we obtain the following inequal¬ 
ity: 


E 




.t=l 


= o 


{pk - zY 


. (17) 




E Hpkjt) < z,S{t),U{t),Nk{t) < Ac} 

Na T 

< (1 - Pk,n{z)qA (by 0 


n=0 m—1 

< 1 _i+ 

^ Pk,u{z)q q \Pk,n{z) J q 

By using Lemma we obtain 


E 


< 


T-l 

24 


- 1 


AkizY 

T-l 


y Q j g-Afc(z) n/2 


,-DkG)n 


n=[8/Afc(z)] 


(n+ l)Afe(z)2 


1 

( 21 ) 


By using the fact that Zl/j(z) =d{z,pk) = Ll{l/{pk — zY) 
(from the Pinsker’s inequality), it is easy to verify that the 
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RHS of (21 1 is 0(1/(/ifc — zy). By using these facts, we 
finally obtain 


E 


< z,S{t),U{t),Nk{t) < Nc} 


t=i 


< -E 

= O 


'T-l 

y, 

.^0 


1 


-1 


K 


l-q 


q{pk - zf 


■K 


l-q 


which concludes the proof of the lemma. 


□ 


Proof. Here, we prove inequality (13 1 . Recall that 


T T 

(A) = y i{s'=(f)} = y i{r (f) < y)}. 

i=l t=l 


Since 9* (t) is the L-th largest posterior sample among arms 
at round t, 9* (t) < ^ implies that, there exists at least 

one arm in [L] with its posterior sample smaller than ^ 
Namely, 

{r(f)<yyc u 

fee[L] 


Lemma 10. (Deviation of empirical averages, Agrawal & 


Goyal (2013b Appendix B.l)) Let fc G [AT] and z > Hk be 


arbitrary. Then, 


E 


k{t),flk{t) > z} 

.t=0 


< 1 + 


1 

d{z,tik)' 


Lemma 11. (Deviation of Beta posteriors) Let k G [K], 
G [0,1] be arbitrary values such that Xi > X 2 , and 
n > 1. Then, 


and therefore 


{r(f) < yy 

= U < T^L\d*{t) < yy 

ke[L] 

ke[L] 


c U {Okit) < ii[ ^ 

/cG[L] 


max^^^9dt) < ti'r 


Pi'(^'fc(i) > < X2,Nk{t) = n) 

< exp {—d{x2, Xi)n). 


Proof. Note that, this lemma is essentially the same as 


the first display in Agrawal & Goyal 

(2013b Appendix 

B.2). While 1 Agrawal & Goyal 

(2013 

^ provide a bound 


for Nk{t) > n, the bound in our lemma is for Nk{t) = n. 
For the sake of rigor, we write the proof here. 


Pr( 0 j(f) > xi\fij{t) < X2,Nj{t) = n) 

= Prf 0 ~ Beta(/ij(t)n + 1, (1 — fij{t))n + 1), 


9 > xi 


Tj{t) < X 2 


~ ^ p£C2n+l.(l-a;2)n+l(^l) 

= Fn+l,xA^2n) 


(by the Beta-Binomial equality) 
< - exp{-d{x2,xi)n) 

(by the Chernoff bound). 


By using the union bound, we obtain 

i{r(f)<py} 

< y i{9k{t) < max(^)6»j(f) < 

feG[L] ^ 

Note that the event dj(t) < ^ satisfies the 

condition for the event S{t) in ( [TT] ) in Lemma with z := 
/i^ ^ Therefore we obtain from Lemma|^that 


E 


J2l{9*{t)<^x[-^} 

.t=i 

1 


1 


which concludes the proof of inequality ( [T3| l. 
Evaluation of term (B): 


□ 


Proof. Here, we prove inequality (14 1 . We have. 


A.4. Proof of Lemma 12 
Evaluation of term (A): 


□ (B) = yi{A(f),c/(f)} 




= Em U 


‘=1 ljem\([i-i]u{*}) 
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= E E 

T 

= E E 

( 22 ) 

In the following, we bound the first and the second terms 
in the inner sum of the last line of ( |22] l. From Lemma [TO] 
the first term of (|22ll is bounded as 




E 


'^l{A^{t),fi,{t) > hl} 


,i=l 


< 1 + 


1 




= 0 ( 1 ). 


On the other hand, the second term of ( [22| t is transformed 
as 

T 


t=l 


< 


log log T 


+ ^1 N,{t) > 


< 


log log T 

d{p,L,v) 

T 


Since is the (L — l)-th largest posterior sample 

among arms except for i and j, indicates that, 

the number of arms excluding i and j with posterior sam¬ 
ples larger than or equal to z/ is at most L — 2, and thus at 
least one arm among [L — 1] has its posterior smaller than 
ly. Namely, 

{P-t* jit) <’^} = { max < z/} 

= [J {Pkit) < V, max < z/} 

^ "'.El ^ 

By using this, we have 

T 

sE E !{«.«)> 


<=1 feG[L-l] 


d{pL,v) 


mK]\{i,3,k} 

Moreover, let 1^2 = {v + = (pl-i -f 3/ri)/4. For 

A: G [L — 1], /Tfe > ZA > za 2 > Pi and 

Pr{p,(f)<ZA,iV,(f)> ^(^°g^^)J 

T 

< ^ Pr{/ife(f) < v.Nkit) = n} 

log T 

T 

< ^ Pr{/ife(f) < i/, fik{t) > 1 ^ 2 , Nk{t) = n} 


log T 

'2 


-2(^-^2V 


+ ^ Pr{/ifc(f) < V 2 ,Nk{t) = n} 




^ E 


^—d{v2,y)n 


. logT 

2{u-U2y^ 

T 


+ E P^{A/c(A) < V 2 ,Nk{t) = n} 


2(i'-i'2)^ 


(by Lemma [TT) 




-d(v2,p.k)n 


— log T 


(by Chernoff bound) 

= 0(1/T) (by (/ife — z/ 2 ) > {v — i' 2 ) and Pinsker’s inequality) 
and thus 

T 


E E 


> 


log log T 


d{nL,i') ’ 

maj 

ie[K]\{^,j,k} 


i=l feG[L-l] 

pi{t) < HL,Pk{t) < V, max < v) 

ig\k]\UJM j 

T 

^E E Pr{jv,(f) > , Nkit) < 


log log r logT 


t=i feG[L-l] 


d{fj,L,v)’ 2 {v-V2Y' 

pr{t) < fJ.L,Pk(t) < V, max <iy\ 

ig\k]\UJM j 


+ E E Pr{pk(t)<iy,JVk(t)> 

t=i fce[L-i] ^ ' 


< 


E EPfi^-w 


log logT 


logT 


t=l feG[L-l] 


^ J/ ^ ’ ^k(t) ^ nf \2^ 

d{fJ,L,v) 2 (V-V 2 r 


Piifi)<lkL,Pk{t) <v, max Pfl.i{t)<v\ 

+ 0 ( 1 ). 
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Here, z := v, S{t) := ^ 

T{t) := {jiiit) < v), := sat¬ 

isfy the conditions in Lemma Under if (t), T(f) holds 
with probability at least 

1 -(^^)) = 1 - (logr)-' 

by Lemma 0 Therefore, by using Lemma with Nc = 
log T / {2{v — V 2 Y'), we obtain 


By using Lemma 10 with z := v^, the first term in ( |24l i is 
bounded as: 


E 


>V2} 




< 1 + 




= 0 


{v2- 


= o 


(PL_1 - 


= 0 ( 1 ). 

(25) 


E 


1 


log log T 


logT 


> J 

d{fj.L,v) 2{v-V2r 


fj'iit) < piL,idk{t) < ly, max 


< o 


1 


(1 - (logT)-Y(fkk - 

(logT)-i logT 


O 


i-(iogr)-i 2 (z/-i/ 2 )' 


= 0(1). (23) 


From ( |2T| and the union bound over k € [L—1], the second 
term of ( |22| ) is 0(1). In summary, term (B) is 0(log log T) 
in expectation. □ 


Evaluation of term (C): 


Proof. Here, we prove inequality (15 1 . Recall that. 


We now bound the second term in LetCU(f) = 

>y}^ Ci(f). Letfj(f) = {NYt) > £2 logT}. 

We have, 

T 

T 

- <V2} 

t^l 

< €2 log T 

T 

A'^HAYt),Aj{t),C[^j(t),Vi{t),fiYt) < v2,£A)}- 

t^l 

iV|^^(T)-l T 

<62 logT-f ^ ^ 

n—0 t—1 

l{A^{t), Aj{t),C'ij{t)Ai{t) = n,jlj{t) < v 2 ,Sj{f)}. 


T 

A)= E Y.^{A{t),AAYA(t),vYt)}. 

Let U 2 = {v + Pl)I"2 = {y.L-\ + 3/iL)/4. Note that, 
we defined v and V2 such that /tl-i > v > V2 > fiL, 
0(pl_i -ly) = Oiv - V 2 ) = 0{i^2 - Pl) = 0(^L-i - 
Hif) = 0(1) as a function of T. Then, 

T 

T 

= E HAi(t),Aj{t),Ci{f),VYt),fijY) > ^2} 

T 

+ E MAi{t),Aj{t),Ci{t),'Di{t),fij{t) < 1^2} 

T 

T 

+ E i{Ai{t),Aj{t),Ci{t),'Di{t),fij{t) < 1/2}. 


In the following, we bound 

T 

^l{Ai{t),Aj{t)Ai,A)Ai{t) =n,ilj{t) < z/ 2 ,fj(f)}. 

(26) 

Note that, ( |26l ) is at most 1 since {Ai{t), Ni(t) = n} oc¬ 
curs at most once. Let t be the first round (if exists) at 
which {C[j{f),e^* .{f) < 0i{t),Ai{t),Ni(t) = n} is sat¬ 
isfied. It is necessary that {6 *j(t) > 6*** ,-(t)| for (|26]l to be 
1: this is because, (i) both Oi (r) and 9j (t) need to be larger 
than 0**j{t) for the simultaneous draw of arms i and j, 
(ii) and if Oj{T) < 9**jir) then arm i is drawn and thus 
{Ni{t) = n} is never satisfied after t > t. Here, 


Pr{6»j(T) > 6 »^U(t),6»^U(t) > v,ilj{T) < V2} 

< eyip{-d{v2,v)Nj{T)), 

by Lemma [TT] Therefore, we have 


E 


'^l{Ai{f),Aj{f),Ci{t),Ni{t) = < 1^2} 

< exp (-fi(i/ 2 ,i^)e 2 logT) = . (27) 


(24) 
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In summary, the second term in (|24li is bounded ; 


is bounded as 


E 






E 


< £2 log T + 

^j^-e2d(v2,v 


< E 
= E 


< £2 


and thus. 




logT (by (1 + 5)^ < 4), 


E[(C)] 

^ E 

G[if]\([L-r]u{*}) 

^ E 

G[i<-]\([L-r]u{i}) 


l[A,{t),Bit),fi,{t) < > Nr\T)] 

'im) > < n^\N,{t) > N^Ht)] 

E[i[0,(t) > < B^\m) > NrHT)] 

N,{t)] 

< E E[l[/l,(f) < > N^Ht)] 


(£2 + logr' 

(£2 + 4r-"^'^'.^-i/®j iogr\ 

difJ.i,fJ-L) j 


0 ( 1 ) 


< E 


E 


0 ( 1 ), 


exp{-d{n'l^\ N,{t) 

(by Lemma [TT]i 

= exp{-d{f,^\f,[-^)NrHT)) 

= T-^ (by the definition of (T)), 


(30) 


where we used the fact that d{h' 2 , v) > 2{v — 1 / 2 )^ = 2 x 
((/tl-i — /ri)/4)^ in the last transformation. □ 


Evaluation of term (D): 


where we used the fact E[X] = E[E[X|y]] for any random 
variables X and Y. Putting ([28ll-(|30ll together we obtain 


E[(D)]<1. 


1 


Proof. Here, we prove inequality (16 1 . We first divide term 
(D) into two subterms as: 


E[(D)] = E 


< E 


-E 


> /rl+\iv.(f) > iVr'(T)} 

■ T 

> Nr\T)} 


from which the inequality ( [T6| follows. 

A.5. Proof of Lemma H] 

It suffices to prove that for any a,b>0 

■ 'ji—ae2 


E^■^ 


(31) 

□ 


inf , 

£2>0 I b 


\ f\og\ogT\ 




(28) 


On one hand, the first term in (ESll is bounded as: 


E 


By letting £2 = (log log T)/{a log T), we have 

r rp-a €2 'I |'g-ae 2 logT 

inf —--h £2 > = inf - - -h £2 

e2>0 (0 J e2>0 ( b 

g-log log T log log T 
b a log r 

1 log log T 


< 


> Nr^{T)) 


= O 


b log T a log T 

/ log log T' 


V logP 


< E 

< 1 






and the proof is completed. 


Pi) 


(by Lemma 10 1 . (29) 


On the other hand, each component of the second term of 












































